Now that we've seen how to use scikit-learn, statsmodels and LIME to do modelling, it's time for you to apply this knowledge. This week has three challenges.
Each challenge has three components:
Some important notes for the challenges:
We are constantly monitoring the issues on the GitHub to help you out. Don't hesitate to log an issue there, explaining well what the problem is, showing the code you are using, and the error message you may be receiving.
Important: We are only monitoring the repository in weekdays, from 9.30 to 17.00. Issues logged after this time will most likely be answered the next day. This means you should not wait for our response before submitting a challenge :-)
In DA5 and 6, we will work with a simulated dataset from an online store that created a set of campaigns. This dataset will be used to predict whether someone made purchases, or not (i.e., purchase ) and, if so, how much they spent (i.e.,order_euros). These predictions will be later used for targeting users. For example, the marketing manager wants to identify users that have a high likelihood of making a purchase to target advertising campaigns to increase their awareness of the store, and to identify users that are likely to spend more than 100 euros on the store, and give them discounts that other users would not receive.
These predictions will be made about the dependent variables (or targets): purchase and order_euros. Your independent variables (or features) will be relevant characteristics available on (or created from) the datasets below.
You will find three files in this folder (starting with da5-6):
da56_orders.csv.gz (or da56_orders.pkl.gz for a pickle version). This file contains a list of orders made on an online store. It contains the session_id that made the order, whether there was a purchase, and how many euros were spent in that order.da56_sessions.csv.gz (or da56_sessions.pkl.gz for a pickle version). This file contains a list of online sessions in the website. For each session, it contains the details of the browser that visited the site (user_agent), when the visit started, what the referral was, and if the referral was part of a paid campaign. If a user had a profile when doing the purchase, this file will also contain their user id.da56_users.jsonl. This file contains a list of registered users on the website, their name, when they created their profile, their age, the referral that they followed when creating their profile, and whether they are part of an exclusive club of preferential clients.In this first challenge, you will load, inspeact, clean and visualize your own dataset by:
Tip: If you decide to use the da56_users dataset, you may end up with several missing values (for users that entered the website but did not have a profile, or made the purchase checking out as guest). You can decide whether to fill these missing values and keep the cases, if appropriate, or simply to work with a smaller dataset (only for registered users). Both ways are fine but, whatever you do, make sure to explain what you are doing and justify your choices.
Write a technical and a business interpretation for the creation of the dataset.
Use Markdown formatting to make the summary clear and visually appealing, and constantly refer to the output of the code to substantiate your claims. If needed, you can combine the code with the summary (e.g., start the summary in markdown, run the relevant code in the next cell, and interpret the output using MarkDown in the next cell).
In sections 1 and 2 of their article, Tràmer et al. (2017) discuss how unfair treatments can arise from irresponsible data usage, and provide a few examples. You are now working with digital trace data from a (fictitious) online store, and will later create models that will be used to target campaigns or give discounts to consumers. Please briefly discuss how this may create the risk for unfair treatment (and indicate specifically what that unfair treament would be, and why). Moving one step further, look at the variables in the original datasets (da56_orders, da56_sessions and da56_users). Select one that you believe that has the most potential for creating risks of unfair treatment, and explain why.
import pandas as pd
import seaborn as sns
%matplotlib inline
orders = pd.read_pickle('da56_orders.pkl.gz')
orders.columns
Index(['session_id', 'order_euros', 'purchase'], dtype='object')
orders.dtypes
session_id int64 order_euros float64 purchase float64 dtype: object
orders.isna().sum()
session_id 0 order_euros 0 purchase 0 dtype: int64
orders.head()
| session_id | order_euros | purchase | |
|---|---|---|---|
| 0 | 5555694754 | 150.0 | 1.0 |
| 11 | 5555694765 | 150.0 | 1.0 |
| 18 | 5555694772 | 150.0 | 1.0 |
| 19 | 5555694773 | 239.0 | 1.0 |
| 20 | 5555694774 | 150.0 | 1.0 |
orders[['order_euros', 'purchase']].describe()
| order_euros | purchase | |
|---|---|---|
| count | 11339.000000 | 11339.0 |
| mean | 178.913661 | 1.0 |
| std | 54.820250 | 0.0 |
| min | 110.500000 | 1.0 |
| 25% | 150.000000 | 1.0 |
| 50% | 150.000000 | 1.0 |
| 75% | 208.000000 | 1.0 |
| max | 487.000000 | 1.0 |
sessions = pd.read_pickle('da56_sessions.pkl.gz')
sessions.columns
Index(['session_id', 'session_timestamp', 'user_agent', 'referral',
'paid_campaign', 'user_id'],
dtype='object')
sessions.dtypes
session_id object session_timestamp datetime64[ns] user_agent object referral object paid_campaign float64 user_id object dtype: object
sessions.isna().sum()
session_id 0 session_timestamp 0 user_agent 0 referral 0 paid_campaign 21569 user_id 34877 dtype: int64
sessions.head()
| session_id | session_timestamp | user_agent | referral | paid_campaign | user_id | |
|---|---|---|---|---|---|---|
| 0 | 5555694754 | 2021-09-20 06:59:09 | Mozilla/5.0 (Windows NT 6.2; lo-LA; rv:1.9.1.2... | google.com | 1.0 | NaN |
| 1 | 5555694755 | 2021-09-23 19:07:17 | Mozilla/5.0 (Windows; U; Windows NT 6.0) Apple... | instagram.com | 4.0 | NaN |
| 2 | 5555694756 | 2021-09-25 14:12:23 | Mozilla/5.0 (Android 2.3.6; Mobile; rv:7.0) Ge... | instagram.com | NaN | NaN |
| 3 | 5555694757 | 2021-09-20 11:12:36 | Mozilla/5.0 (compatible; MSIE 5.0; Windows NT ... | google.com | NaN | NaN |
| 4 | 5555694758 | 2021-09-24 11:12:22 | Opera/9.52.(X11; Linux i686; bho-IN) Presto/2.... | google.com | 2.0 | NaN |
users = pd.read_json('da56_users.jsonl', lines=True)
users.columns
Index(['user_info'], dtype='object')
users.head()
| user_info | |
|---|---|
| 0 | {'id': '55885858', 'reg_name': 'Melissa Hanson... |
| 1 | {'id': '55885859', 'reg_name': 'Danielle Evans... |
| 2 | {'id': '55885860', 'reg_name': 'Erika Horton',... |
| 3 | {'id': '55885861', 'reg_name': 'Nicole Campbel... |
| 4 | {'id': '55885862', 'reg_name': 'Jessica Sanche... |
users = pd.json_normalize(users['user_info'])
users.head()
| id | reg_name | age | registration_date | initial_referrer | preferential_client | |
|---|---|---|---|---|---|---|
| 0 | 55885858 | Melissa Hanson | 21 | 2019-10-29 | instagram.com | NaN |
| 1 | 55885859 | Danielle Evans | 20 | 2019-11-08 | instagram.com | NaN |
| 2 | 55885860 | Erika Horton | 18 | 2021-01-27 | google.com | NaN |
| 3 | 55885861 | Nicole Campbell | 46 | 2020-06-21 | google.com | NaN |
| 4 | 55885862 | Jessica Sanchez | 64 | 2020-02-15 | massey.com | NaN |
users.columns
Index(['id', 'reg_name', 'age', 'registration_date', 'initial_referrer',
'preferential_client'],
dtype='object')
users.isna().sum()
id 0 reg_name 0 age 0 registration_date 0 initial_referrer 0 preferential_client 831 dtype: int64
users.dtypes
id object reg_name object age int64 registration_date object initial_referrer object preferential_client float64 dtype: object
len(users)
1000
Using partially the user information:
Alternatives:
RQ1: To what extent coming from a paid campaign influences the likelihood of a purchase?
H1: Users coming from a paid campaign are more likely to purchase than users not coming from a paid campaign
RQ2: To what extent coming from a paid campaign influences the amount spent (order euros)?
H2: Users coming from a paid campaign will spend more than users not coming from a paid campaign
orders.dtypes
session_id int64 order_euros float64 purchase float64 dtype: object
sessions.dtypes
session_id object session_timestamp datetime64[ns] user_agent object referral object paid_campaign float64 user_id object dtype: object
sessions['session_id'] = sessions['session_id'].apply(pd.to_numeric)
data = sessions.merge(orders, on='session_id', how='left')
len(sessions), len(orders), len(data)
(50000, 11339, 50000)
data.head()
| session_id | session_timestamp | user_agent | referral | paid_campaign | user_id | order_euros | purchase | |
|---|---|---|---|---|---|---|---|---|
| 0 | 5555694754 | 2021-09-20 06:59:09 | Mozilla/5.0 (Windows NT 6.2; lo-LA; rv:1.9.1.2... | google.com | 1.0 | NaN | 150.0 | 1.0 |
| 1 | 5555694755 | 2021-09-23 19:07:17 | Mozilla/5.0 (Windows; U; Windows NT 6.0) Apple... | instagram.com | 4.0 | NaN | NaN | NaN |
| 2 | 5555694756 | 2021-09-25 14:12:23 | Mozilla/5.0 (Android 2.3.6; Mobile; rv:7.0) Ge... | instagram.com | NaN | NaN | NaN | NaN |
| 3 | 5555694757 | 2021-09-20 11:12:36 | Mozilla/5.0 (compatible; MSIE 5.0; Windows NT ... | google.com | NaN | NaN | NaN | NaN |
| 4 | 5555694758 | 2021-09-24 11:12:22 | Opera/9.52.(X11; Linux i686; bho-IN) Presto/2.... | google.com | 2.0 | NaN | NaN | NaN |
data.dtypes
session_id int64 session_timestamp datetime64[ns] user_agent object referral object paid_campaign float64 user_id object order_euros float64 purchase float64 dtype: object
users.dtypes
id object reg_name object age int64 registration_date object initial_referrer object preferential_client float64 dtype: object
users = users.rename(columns={'id': 'user_id'})
users.dtypes
user_id object reg_name object age int64 registration_date object initial_referrer object preferential_client float64 dtype: object
data_users = data.merge(users, on='user_id', how='left')
len(data_users), len(data), len(users)
(50000, 50000, 1000)
data_users.head()
| session_id | session_timestamp | user_agent | referral | paid_campaign | user_id | order_euros | purchase | reg_name | age | registration_date | initial_referrer | preferential_client | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 5555694754 | 2021-09-20 06:59:09 | Mozilla/5.0 (Windows NT 6.2; lo-LA; rv:1.9.1.2... | google.com | 1.0 | NaN | 150.0 | 1.0 | NaN | NaN | NaN | NaN | NaN |
| 1 | 5555694755 | 2021-09-23 19:07:17 | Mozilla/5.0 (Windows; U; Windows NT 6.0) Apple... | instagram.com | 4.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | 5555694756 | 2021-09-25 14:12:23 | Mozilla/5.0 (Android 2.3.6; Mobile; rv:7.0) Ge... | instagram.com | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | 5555694757 | 2021-09-20 11:12:36 | Mozilla/5.0 (compatible; MSIE 5.0; Windows NT ... | google.com | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 5555694758 | 2021-09-24 11:12:22 | Opera/9.52.(X11; Linux i686; bho-IN) Presto/2.... | google.com | 2.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
data_users.isna().sum()
session_id 0 session_timestamp 0 user_agent 0 referral 0 paid_campaign 21569 user_id 34877 order_euros 38661 purchase 38661 reg_name 34877 age 34877 registration_date 34877 initial_referrer 34877 preferential_client 47488 dtype: int64
Dependent variables:
Independent variable(s):
df = data_users[['session_id', 'order_euros', 'purchase', 'paid_campaign',
'preferential_client', 'referral', 'user_agent']]
df.head()
| session_id | order_euros | purchase | paid_campaign | preferential_client | referral | user_agent | |
|---|---|---|---|---|---|---|---|
| 0 | 5555694754 | 150.0 | 1.0 | 1.0 | NaN | google.com | Mozilla/5.0 (Windows NT 6.2; lo-LA; rv:1.9.1.2... |
| 1 | 5555694755 | NaN | NaN | 4.0 | NaN | instagram.com | Mozilla/5.0 (Windows; U; Windows NT 6.0) Apple... |
| 2 | 5555694756 | NaN | NaN | NaN | NaN | instagram.com | Mozilla/5.0 (Android 2.3.6; Mobile; rv:7.0) Ge... |
| 3 | 5555694757 | NaN | NaN | NaN | NaN | google.com | Mozilla/5.0 (compatible; MSIE 5.0; Windows NT ... |
| 4 | 5555694758 | NaN | NaN | 2.0 | NaN | google.com | Opera/9.52.(X11; Linux i686; bho-IN) Presto/2.... |
df.isna().sum()
session_id 0 order_euros 38661 purchase 38661 paid_campaign 21569 preferential_client 47488 referral 0 user_agent 0 dtype: int64
df['order_euros'].describe()
count 11339.000000 mean 178.913661 std 54.820250 min 110.500000 25% 150.000000 50% 150.000000 75% 208.000000 max 487.000000 Name: order_euros, dtype: float64
df['order_euros'] = df['order_euros'].fillna(0)
/var/folders/vf/g8sc1dyd5tb46q752y6jl2k40000gn/T/ipykernel_9043/1938906212.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df['order_euros'] = df['order_euros'].fillna(0)
df['order_euros'].describe()
count 50000.000000 mean 40.574040 std 79.338489 min 0.000000 25% 0.000000 50% 0.000000 75% 0.000000 max 487.000000 Name: order_euros, dtype: float64
df['purchase'].value_counts()
1.0 11339 Name: purchase, dtype: int64
df['purchase'] = df['purchase'].fillna(0)
/var/folders/vf/g8sc1dyd5tb46q752y6jl2k40000gn/T/ipykernel_9043/3302342748.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df['purchase'] = df['purchase'].fillna(0)
df['purchase'].value_counts()
0.0 38661 1.0 11339 Name: purchase, dtype: int64
df['paid_campaign'].value_counts()
1.0 7179 3.0 7157 4.0 7121 2.0 6974 Name: paid_campaign, dtype: int64
df['paid_campaign'] = df['paid_campaign'].fillna(0)
/var/folders/vf/g8sc1dyd5tb46q752y6jl2k40000gn/T/ipykernel_9043/3905136625.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df['paid_campaign'] = df['paid_campaign'].fillna(0)
df['paid_campaign'].value_counts()
0.0 21569 1.0 7179 3.0 7157 4.0 7121 2.0 6974 Name: paid_campaign, dtype: int64
df['preferential_client'].value_counts()
1.0 2512 Name: preferential_client, dtype: int64
df['preferential_client'] = df['preferential_client'].fillna(0)
/var/folders/vf/g8sc1dyd5tb46q752y6jl2k40000gn/T/ipykernel_9043/925174362.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df['preferential_client'] = df['preferential_client'].fillna(0)
df['preferential_client'].value_counts()
0.0 47488 1.0 2512 Name: preferential_client, dtype: int64
df.head()
| session_id | order_euros | purchase | paid_campaign | preferential_client | referral | user_agent | |
|---|---|---|---|---|---|---|---|
| 0 | 5555694754 | 150.0 | 1.0 | 1.0 | 0.0 | google.com | Mozilla/5.0 (Windows NT 6.2; lo-LA; rv:1.9.1.2... |
| 1 | 5555694755 | 0.0 | 0.0 | 4.0 | 0.0 | instagram.com | Mozilla/5.0 (Windows; U; Windows NT 6.0) Apple... |
| 2 | 5555694756 | 0.0 | 0.0 | 0.0 | 0.0 | instagram.com | Mozilla/5.0 (Android 2.3.6; Mobile; rv:7.0) Ge... |
| 3 | 5555694757 | 0.0 | 0.0 | 0.0 | 0.0 | google.com | Mozilla/5.0 (compatible; MSIE 5.0; Windows NT ... |
| 4 | 5555694758 | 0.0 | 0.0 | 2.0 | 0.0 | google.com | Opera/9.52.(X11; Linux i686; bho-IN) Presto/2.... |
df['referral'].value_counts()
google.com 11168
instagram.com 11136
facebook.com 10938
smith.com 135
johnson.com 104
...
pitts.org 1
cunningham-wilson.com 1
williams-holloway.com 1
brown-stone.info 1
hodge.org 1
Name: referral, Length: 8630, dtype: int64
def parse_variables(row):
# Handling paid campaign
row['paid_campaign_binary'] = 0
if row['paid_campaign'] > 0:
row['paid_campaign_binary'] = 1
row['paid_campaign_1'] = 0
row['paid_campaign_2'] = 0
row['paid_campaign_3'] = 0
row['paid_campaign_4'] = 0
if row['paid_campaign'] == 1:
row['paid_campaign_1'] = 1
if row['paid_campaign'] == 2:
row['paid_campaign_2'] = 1
if row['paid_campaign'] == 3:
row['paid_campaign_3'] = 1
if row['paid_campaign'] == 4:
row['paid_campaign_4'] = 1
# Handling referral
if 'google' in str(row['referral']).lower():
row['referral_google'] = 1
else:
row['referral_google'] = 0
if 'instagram' in str(row['referral']).lower():
row['referral_instagram'] = 1
else:
row['referral_instagram'] = 0
if 'facebook' in str(row['referral']).lower():
row['referral_facebook'] = 1
else:
row['referral_facebook'] = 0
# Handling user agent
if 'mac' in str(row['user_agent']).lower():
row['user_agent_mac'] = 1
else:
row['user_agent_mac'] = 0
if 'windows' in str(row['user_agent']).lower():
row['user_agent_windows'] = 1
else:
row['user_agent_windows'] = 0
return row
df.iloc[0]
session_id 5555694754 order_euros 150.0 purchase 1.0 paid_campaign 1.0 preferential_client 0.0 referral google.com user_agent Mozilla/5.0 (Windows NT 6.2; lo-LA; rv:1.9.1.2... Name: 0, dtype: object
parse_variables(df.iloc[0])
/var/folders/vf/g8sc1dyd5tb46q752y6jl2k40000gn/T/ipykernel_9043/1974942019.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy row['paid_campaign_binary'] = 0 /Users/t.b.araujouva.nl/opt/anaconda3/lib/python3.8/site-packages/pandas/core/indexing.py:692: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy iloc._setitem_with_indexer(indexer, value, self.name) /var/folders/vf/g8sc1dyd5tb46q752y6jl2k40000gn/T/ipykernel_9043/1974942019.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy row['paid_campaign_binary'] = 1 /var/folders/vf/g8sc1dyd5tb46q752y6jl2k40000gn/T/ipykernel_9043/1974942019.py:7: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy row['paid_campaign_1'] = 0 /var/folders/vf/g8sc1dyd5tb46q752y6jl2k40000gn/T/ipykernel_9043/1974942019.py:8: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy row['paid_campaign_2'] = 0 /var/folders/vf/g8sc1dyd5tb46q752y6jl2k40000gn/T/ipykernel_9043/1974942019.py:9: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy row['paid_campaign_3'] = 0 /var/folders/vf/g8sc1dyd5tb46q752y6jl2k40000gn/T/ipykernel_9043/1974942019.py:10: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy row['paid_campaign_4'] = 0 /var/folders/vf/g8sc1dyd5tb46q752y6jl2k40000gn/T/ipykernel_9043/1974942019.py:13: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy row['paid_campaign_1'] = 1 /var/folders/vf/g8sc1dyd5tb46q752y6jl2k40000gn/T/ipykernel_9043/1974942019.py:27: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy row['referral_google'] = 1 /var/folders/vf/g8sc1dyd5tb46q752y6jl2k40000gn/T/ipykernel_9043/1974942019.py:34: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy row['referral_instagram'] = 0 /var/folders/vf/g8sc1dyd5tb46q752y6jl2k40000gn/T/ipykernel_9043/1974942019.py:40: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy row['referral_facebook'] = 0 /var/folders/vf/g8sc1dyd5tb46q752y6jl2k40000gn/T/ipykernel_9043/1974942019.py:46: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy row['user_agent_mac'] = 0 /var/folders/vf/g8sc1dyd5tb46q752y6jl2k40000gn/T/ipykernel_9043/1974942019.py:49: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy row['user_agent_windows'] = 1
session_id 5555694754 order_euros 150.0 purchase 1.0 paid_campaign 1.0 preferential_client 0.0 referral google.com user_agent Mozilla/5.0 (Windows NT 6.2; lo-LA; rv:1.9.1.2... paid_campaign_binary 1 paid_campaign_1 1 paid_campaign_2 0 paid_campaign_3 0 paid_campaign_4 0 referral_google 1 referral_instagram 0 referral_facebook 0 user_agent_mac 0 user_agent_windows 1 Name: 0, dtype: object
parse_variables(df.iloc[100])
/var/folders/vf/g8sc1dyd5tb46q752y6jl2k40000gn/T/ipykernel_9043/1974942019.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy row['paid_campaign_binary'] = 0 /var/folders/vf/g8sc1dyd5tb46q752y6jl2k40000gn/T/ipykernel_9043/1974942019.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy row['paid_campaign_binary'] = 1 /var/folders/vf/g8sc1dyd5tb46q752y6jl2k40000gn/T/ipykernel_9043/1974942019.py:7: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy row['paid_campaign_1'] = 0 /var/folders/vf/g8sc1dyd5tb46q752y6jl2k40000gn/T/ipykernel_9043/1974942019.py:8: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy row['paid_campaign_2'] = 0 /var/folders/vf/g8sc1dyd5tb46q752y6jl2k40000gn/T/ipykernel_9043/1974942019.py:9: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy row['paid_campaign_3'] = 0 /var/folders/vf/g8sc1dyd5tb46q752y6jl2k40000gn/T/ipykernel_9043/1974942019.py:10: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy row['paid_campaign_4'] = 0 /var/folders/vf/g8sc1dyd5tb46q752y6jl2k40000gn/T/ipykernel_9043/1974942019.py:13: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy row['paid_campaign_1'] = 1 /var/folders/vf/g8sc1dyd5tb46q752y6jl2k40000gn/T/ipykernel_9043/1974942019.py:29: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy row['referral_google'] = 0 /var/folders/vf/g8sc1dyd5tb46q752y6jl2k40000gn/T/ipykernel_9043/1974942019.py:34: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy row['referral_instagram'] = 0 /var/folders/vf/g8sc1dyd5tb46q752y6jl2k40000gn/T/ipykernel_9043/1974942019.py:38: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy row['referral_facebook'] = 1 /var/folders/vf/g8sc1dyd5tb46q752y6jl2k40000gn/T/ipykernel_9043/1974942019.py:46: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy row['user_agent_mac'] = 0 /var/folders/vf/g8sc1dyd5tb46q752y6jl2k40000gn/T/ipykernel_9043/1974942019.py:49: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy row['user_agent_windows'] = 1
session_id 5555694854 order_euros 0.0 purchase 0.0 paid_campaign 1.0 preferential_client 0.0 referral facebook.com user_agent Opera/9.60.(Windows 95; lij-IT) Presto/2.9.188... paid_campaign_binary 1 paid_campaign_1 1 paid_campaign_2 0 paid_campaign_3 0 paid_campaign_4 0 referral_google 0 referral_instagram 0 referral_facebook 1 user_agent_mac 0 user_agent_windows 1 Name: 100, dtype: object
df = df.apply(parse_variables, axis=1)
df.groupby(['paid_campaign_binary', 'paid_campaign'])['session_id'].count()
paid_campaign_binary paid_campaign
0 0.0 21569
1 1.0 7179
2.0 6974
3.0 7157
4.0 7121
Name: session_id, dtype: int64
df.groupby(['paid_campaign_1', 'paid_campaign'])['session_id'].count()
paid_campaign_1 paid_campaign
0 0.0 21569
2.0 6974
3.0 7157
4.0 7121
1 1.0 7179
Name: session_id, dtype: int64
df.groupby(['paid_campaign_2', 'paid_campaign'])['session_id'].count()
paid_campaign_2 paid_campaign
0 0.0 21569
1.0 7179
3.0 7157
4.0 7121
1 2.0 6974
Name: session_id, dtype: int64
df.groupby(['paid_campaign_3', 'paid_campaign'])['session_id'].count()
paid_campaign_3 paid_campaign
0 0.0 21569
1.0 7179
2.0 6974
4.0 7121
1 3.0 7157
Name: session_id, dtype: int64
df.groupby(['paid_campaign_4', 'paid_campaign'])['session_id'].count()
paid_campaign_4 paid_campaign
0 0.0 21569
1.0 7179
2.0 6974
3.0 7157
1 4.0 7121
Name: session_id, dtype: int64
df[df['referral_google']==1]['referral'].value_counts()
google.com 11168 Name: referral, dtype: int64
df[df['referral_google']==0]['referral'].value_counts()
instagram.com 11136
facebook.com 10938
smith.com 135
johnson.com 104
williams.com 103
...
gonzales-brown.com 1
miller-graves.com 1
stuart-walker.info 1
keller-washington.org 1
hodge.org 1
Name: referral, Length: 8629, dtype: int64
df[df['referral_instagram']==1]['referral'].value_counts()
instagram.com 11136 Name: referral, dtype: int64
df[df['referral_instagram']==0]['referral'].value_counts()
google.com 11168
facebook.com 10938
smith.com 135
johnson.com 104
williams.com 103
...
gonzales-brown.com 1
miller-graves.com 1
stuart-walker.info 1
keller-washington.org 1
hodge.org 1
Name: referral, Length: 8629, dtype: int64
df[df['referral_facebook']==1]['referral'].value_counts()
facebook.com 10938 Name: referral, dtype: int64
df[df['referral_facebook']==0]['referral'].value_counts()
google.com 11168
instagram.com 11136
smith.com 135
johnson.com 104
williams.com 103
...
stuart-walker.info 1
keller-washington.org 1
bailey-blair.info 1
cameron.info 1
hodge.org 1
Name: referral, Length: 8629, dtype: int64
df[df['user_agent_mac']==1]['user_agent'].value_counts()
Mozilla/5.0 (iPod; U; CPU iPhone OS 3_1 like Mac OS X; ru-UA) AppleWebKit/533.7.2 (KHTML, like Gecko) Version/3.0.5 Mobile/8B112 Safari/6533.7.2 1
Mozilla/5.0 (Macintosh; Intel Mac OS X 10 10_9 rv:6.0; ru-UA) AppleWebKit/535.36.1 (KHTML, like Gecko) Version/5.0.5 Safari/535.36.1 1
Mozilla/5.0 (Macintosh; PPC Mac OS X 10 5_9 rv:6.0; ckb-IQ) AppleWebKit/531.1.7 (KHTML, like Gecko) Version/5.0 Safari/531.1.7 1
Mozilla/5.0 (iPad; CPU iPad OS 10_3_3 like Mac OS X) AppleWebKit/533.2 (KHTML, like Gecko) FxiOS/17.9s6334.0 Mobile/71W933 Safari/533.2 1
Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10 10_8) AppleWebKit/534.2 (KHTML, like Gecko) Chrome/44.0.837.0 Safari/534.2 1
..
Mozilla/5.0 (Macintosh; Intel Mac OS X 10 8_1 rv:3.0; gez-ER) AppleWebKit/534.32.7 (KHTML, like Gecko) Version/5.0 Safari/534.32.7 1
Mozilla/5.0 (iPod; U; CPU iPhone OS 4_0 like Mac OS X; th-TH) AppleWebKit/532.5.4 (KHTML, like Gecko) Version/3.0.5 Mobile/8B117 Safari/6532.5.4 1
Mozilla/5.0 (iPhone; CPU iPhone OS 9_3_5 like Mac OS X) AppleWebKit/534.2 (KHTML, like Gecko) FxiOS/11.5t8039.0 Mobile/77N124 Safari/534.2 1
Mozilla/5.0 (iPad; CPU iPad OS 6_1_6 like Mac OS X) AppleWebKit/531.2 (KHTML, like Gecko) CriOS/33.0.837.0 Mobile/08F889 Safari/531.2 1
Mozilla/5.0 (iPod; U; CPU iPhone OS 4_3 like Mac OS X; ce-RU) AppleWebKit/533.50.7 (KHTML, like Gecko) Version/4.0.5 Mobile/8B113 Safari/6533.50.7 1
Name: user_agent, Length: 14778, dtype: int64
df[df['user_agent_mac']==0]['user_agent'].value_counts()
Mozilla/5.0 (compatible; MSIE 8.0; Windows 98; Trident/5.1) 40
Mozilla/5.0 (compatible; MSIE 5.0; Windows NT 5.1; Trident/5.0) 38
Mozilla/5.0 (compatible; MSIE 6.0; Windows NT 5.2; Trident/4.0) 38
Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 5.1; Trident/5.1) 37
Mozilla/5.0 (compatible; MSIE 6.0; Windows NT 5.01; Trident/5.1) 36
..
Opera/8.41.(Windows CE; lij-IT) Presto/2.9.173 Version/11.00 1
Opera/8.28.(X11; Linux x86_64; kw-GB) Presto/2.9.164 Version/10.00 1
Mozilla/5.0 (Windows NT 6.0) AppleWebKit/532.2 (KHTML, like Gecko) Chrome/55.0.848.0 Safari/532.2 1
Opera/9.99.(Windows NT 10.0; ia-FR) Presto/2.9.165 Version/10.00 1
Opera/8.39.(Windows NT 6.1; sl-SI) Presto/2.9.166 Version/10.00 1
Name: user_agent, Length: 25107, dtype: int64
df[df['user_agent_windows']==1]['user_agent'].value_counts()
Mozilla/5.0 (compatible; MSIE 8.0; Windows 98; Trident/5.1) 40
Mozilla/5.0 (compatible; MSIE 6.0; Windows NT 5.2; Trident/4.0) 38
Mozilla/5.0 (compatible; MSIE 5.0; Windows NT 5.1; Trident/5.0) 38
Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 5.1; Trident/5.1) 37
Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 5.01; Trident/4.1) 36
..
Opera/8.85.(Windows NT 6.0; nb-NO) Presto/2.9.173 Version/12.00 1
Mozilla/5.0 (Windows; U; Windows NT 4.0) AppleWebKit/535.42.5 (KHTML, like Gecko) Version/5.1 Safari/535.42.5 1
Opera/9.30.(Windows NT 5.2; eo-US) Presto/2.9.184 Version/12.00 1
Mozilla/5.0 (Windows; U; Windows NT 6.2) AppleWebKit/535.30.6 (KHTML, like Gecko) Version/5.0.2 Safari/535.30.6 1
Opera/8.39.(Windows NT 6.1; sl-SI) Presto/2.9.166 Version/10.00 1
Name: user_agent, Length: 12626, dtype: int64
df[df['user_agent_windows']==0]['user_agent'].value_counts()
Mozilla/5.0 (Android 2.3.7; Mobile; rv:61.0) Gecko/61.0 Firefox/61.0 5
Mozilla/5.0 (Android 7.1; Mobile; rv:60.0) Gecko/60.0 Firefox/60.0 5
Mozilla/5.0 (Android 4.1.2; Mobile; rv:55.0) Gecko/55.0 Firefox/55.0 4
Mozilla/5.0 (Android 2.2.2; Mobile; rv:22.0) Gecko/22.0 Firefox/22.0 4
Mozilla/5.0 (Android 2.3.6; Mobile; rv:19.0) Gecko/19.0 Firefox/19.0 4
..
Mozilla/5.0 (iPhone; CPU iPhone OS 9_3_6 like Mac OS X) AppleWebKit/533.0 (KHTML, like Gecko) FxiOS/12.0o1808.0 Mobile/63R671 Safari/533.0 1
Mozilla/5.0 (iPhone; CPU iPhone OS 5_1_1 like Mac OS X) AppleWebKit/535.0 (KHTML, like Gecko) FxiOS/10.3t5107.0 Mobile/14Q726 Safari/535.0 1
Mozilla/5.0 (iPod; U; CPU iPhone OS 3_3 like Mac OS X; eo-US) AppleWebKit/533.37.2 (KHTML, like Gecko) Version/4.0.5 Mobile/8B115 Safari/6533.37.2 1
Mozilla/5.0 (iPhone; CPU iPhone OS 7_1_2 like Mac OS X) AppleWebKit/531.2 (KHTML, like Gecko) FxiOS/14.8b4386.0 Mobile/07Z080 Safari/531.2 1
Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10 9_1 rv:2.0; cy-GB) AppleWebKit/535.37.4 (KHTML, like Gecko) Version/5.0.4 Safari/535.37.4 1
Name: user_agent, Length: 27259, dtype: int64
df.head()
| session_id | order_euros | purchase | paid_campaign | preferential_client | referral | user_agent | paid_campaign_binary | paid_campaign_1 | paid_campaign_2 | paid_campaign_3 | paid_campaign_4 | referral_google | referral_instagram | referral_facebook | user_agent_mac | user_agent_windows | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 5555694754 | 150.0 | 1.0 | 1.0 | 0.0 | google.com | Mozilla/5.0 (Windows NT 6.2; lo-LA; rv:1.9.1.2... | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
| 1 | 5555694755 | 0.0 | 0.0 | 4.0 | 0.0 | instagram.com | Mozilla/5.0 (Windows; U; Windows NT 6.0) Apple... | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 |
| 2 | 5555694756 | 0.0 | 0.0 | 0.0 | 0.0 | instagram.com | Mozilla/5.0 (Android 2.3.6; Mobile; rv:7.0) Ge... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 3 | 5555694757 | 0.0 | 0.0 | 0.0 | 0.0 | google.com | Mozilla/5.0 (compatible; MSIE 5.0; Windows NT ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
| 4 | 5555694758 | 0.0 | 0.0 | 2.0 | 0.0 | google.com | Opera/9.52.(X11; Linux i686; bho-IN) Presto/2.... | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
Dependent variables:
Independent variable(s):
Now that you have your dataset created (challenge 1), we would like you to focus on the variable purchase (binary variable) as the DV. You need to:
Optional: Use LIME to explain the predictions created by the model, contrasting the importance of each of the features (IV or controls) in the model.
Write a technical and a business interpretation for the creation of the dataset.
Use Markdown formatting to make the summary clear and visually appealing, and constantly refer to the output of the code to substantiate your claims. If needed, you can combine the code with the summary (e.g., start the summary in markdown, run the relevant code in the next cell, and interpret the output using MarkDown in the next cell).
Hind (2019) discusses four main groups interested in explainable AI. Select the three most relevant groups for this challenge, and explain, in the context of this challenge, why they would be interested by understanding the model being used, and what would they want to know about the model in order to be able to trust its predictions.
df.columns
Index(['session_id', 'order_euros', 'purchase', 'paid_campaign',
'preferential_client', 'referral', 'user_agent', 'paid_campaign_binary',
'paid_campaign_1', 'paid_campaign_2', 'paid_campaign_3',
'paid_campaign_4', 'referral_google', 'referral_instagram',
'referral_facebook', 'user_agent_mac', 'user_agent_windows'],
dtype='object')
df.dtypes
session_id int64 order_euros float64 purchase float64 paid_campaign float64 preferential_client float64 referral object user_agent object paid_campaign_binary int64 paid_campaign_1 int64 paid_campaign_2 int64 paid_campaign_3 int64 paid_campaign_4 int64 referral_google int64 referral_instagram int64 referral_facebook int64 user_agent_mac int64 user_agent_windows int64 dtype: object
RQ1: To what extent coming from a paid campaign influences the likelihood of a purchase?
H1: Users coming from a paid campaign are more likely to purchase than users not coming from a paid campaign
df.groupby(['paid_campaign','paid_campaign_1'])['session_id'].count()
paid_campaign paid_campaign_1 0.0 0 21569 1.0 1 7179 2.0 0 6974 3.0 0 7157 4.0 0 7121 Name: session_id, dtype: int64
df.groupby(['paid_campaign','paid_campaign_binary'])['session_id'].count()
paid_campaign paid_campaign_binary 0.0 0 21569 1.0 1 7179 2.0 1 6974 3.0 1 7157 4.0 1 7121 Name: session_id, dtype: int64
Nature of the dependent variable: binary (0 or 1) = Logistic regression
df['purchase'].describe()
count 50000.000000 mean 0.226780 std 0.418753 min 0.000000 25% 0.000000 50% 0.000000 75% 0.000000 max 1.000000 Name: purchase, dtype: float64
df['purchase'].value_counts()
0.0 38661 1.0 11339 Name: purchase, dtype: int64
from sklearn.linear_model import LogisticRegression
import statsmodels.api as sm
import numpy as np
import lime
from lime import lime_tabular
purchase_stat = sm.Logit(df['purchase'],
sm.add_constant(df[['paid_campaign_binary', 'referral_google',
'referral_instagram', 'referral_facebook',
'user_agent_mac', 'user_agent_windows']]))
purchase_results = purchase_stat.fit()
Optimization terminated successfully.
Current function value: 0.389695
Iterations 7
purchase_results.summary()
| Dep. Variable: | purchase | No. Observations: | 50000 |
|---|---|---|---|
| Model: | Logit | Df Residuals: | 49993 |
| Method: | MLE | Df Model: | 6 |
| Date: | Sun, 03 Oct 2021 | Pseudo R-squ.: | 0.2721 |
| Time: | 16:21:49 | Log-Likelihood: | -19485. |
| converged: | True | LL-Null: | -26768. |
| Covariance Type: | nonrobust | LLR p-value: | 0.000 |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | -2.7171 | 0.038 | -70.706 | 0.000 | -2.792 | -2.642 |
| paid_campaign_binary | 0.8210 | 0.027 | 30.635 | 0.000 | 0.768 | 0.874 |
| referral_google | 2.2565 | 0.034 | 67.334 | 0.000 | 2.191 | 2.322 |
| referral_instagram | -1.1772 | 0.049 | -24.013 | 0.000 | -1.273 | -1.081 |
| referral_facebook | 0.6662 | 0.035 | 19.277 | 0.000 | 0.598 | 0.734 |
| user_agent_mac | 1.4991 | 0.033 | 45.874 | 0.000 | 1.435 | 1.563 |
| user_agent_windows | -0.7520 | 0.034 | -21.952 | 0.000 | -0.819 | -0.685 |
logit_clf = LogisticRegression(max_iter=1000, fit_intercept = True)
logit_clf.fit(df[['paid_campaign_binary', 'referral_google',
'referral_instagram', 'referral_facebook',
'user_agent_mac', 'user_agent_windows']], df['purchase'])
LogisticRegression(max_iter=1000)
logit_clf.predict_proba([[0,0,0,0,0,0]])
array([[0.93787015, 0.06212985]])
logit_clf.predict_proba([[1,0,0,0,0,0]])
array([[0.86925987, 0.13074013]])
logit_clf.predict_proba([[0,1,0,0,0,0]])
array([[0.61317359, 0.38682641]])
logit_clf.predict_proba([[0,0,1,0,0,0]])
array([[0.9799683, 0.0200317]])
logit_clf.predict_proba([[0,0,0,0,0,1]])
array([[0.96970812, 0.03029188]])
logit_clf.predict_proba([[1,1,0,0,0,0]])
array([[0.41113328, 0.58886672]])
logit_clf.predict_proba([[1,1,0,0,1,0]])
array([[0.13510515, 0.86489485]])
data_lime_purchase = df[['paid_campaign_binary', 'referral_google',
'referral_instagram', 'referral_facebook',
'user_agent_mac', 'user_agent_windows','purchase']]
class_names_purchase = data_lime_purchase.columns
X_data_lime_purchase = data_lime_purchase[['paid_campaign_binary', 'referral_google',
'referral_instagram', 'referral_facebook',
'user_agent_mac', 'user_agent_windows']].to_numpy()
y_data_lime_purchase = data_lime_purchase['purchase'].to_numpy()
explainer = lime.lime_tabular.LimeTabularExplainer(
X_data_lime_purchase,
feature_names=class_names_purchase,
verbose=True,
mode='classification')
print(X_data_lime_purchase[500])
exp = explainer.explain_instance(X_data_lime_purchase[500], logit_clf.predict_proba)
exp.show_in_notebook(show_table=True)
[1 0 1 0 1 0] Intercept 0.43254461868119976 Prediction_local [0.23579199] Right: 0.17179235340357854
exp = explainer.explain_instance(np.array([1,1,0,0,1,0]), logit_clf.predict_proba)
exp.show_in_notebook(show_table=True)
Intercept -0.07868311448183393 Prediction_local [0.75449158] Right: 0.8648948455755271
Now that you have your dataset created (challenge 1), we would like you to focus on the variable order_euros as the DV. You need to:
Optional: Use LIME to explain the predictions created by the model, contrasting the importance of each of the features (IV or controls) in the model.
Write a technical and a business interpretation for the creation of the dataset.
Use Markdown formatting to make the summary clear and visually appealing, and constantly refer to the output of the code to substantiate your claims. If needed, you can combine the code with the summary (e.g., start the summary in markdown, run the relevant code in the next cell, and interpret the output using MarkDown in the next cell).
Hind (2019) briefly touches upon global and local approaches for model explanations. Looking at the models and predictions you created, review which aspects would fit into the idea of global explanations, and which would fit the idea of local explanations. Select at least one example for each, and write a brief local explanation for the model, and one global explanation for the model. Use the one of the three relevant groups identified in the reflection of challenge 2 to serve as your stakeholder for the explanation (and indicate in the reflection who the stakeholder is).
RQ2: To what extent coming from a paid campaign influences the amount spent (order euros)?
H2: Users coming from a paid campaign will spend more than users not coming from a paid campaign
Type of dependent variable: order_euros = continuous
df['order_euros'].describe()
count 50000.000000 mean 40.574040 std 79.338489 min 0.000000 25% 0.000000 50% 0.000000 75% 0.000000 max 487.000000 Name: order_euros, dtype: float64
df['order_euros'].value_counts()
0.0 38661
150.0 3310
200.0 594
250.0 328
114.5 76
...
329.5 1
374.5 1
323.5 1
384.5 1
460.5 1
Name: order_euros, Length: 398, dtype: int64
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
import numpy as np
import lime
from lime import lime_tabular
euros_stat = sm.OLS(df['order_euros'],
sm.add_constant(df[['paid_campaign_binary', 'referral_google',
'referral_instagram', 'referral_facebook',
'user_agent_mac', 'user_agent_windows']]))
euros_results = euros_stat.fit()
euros_results.summary()
| Dep. Variable: | order_euros | R-squared: | 0.278 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.278 |
| Method: | Least Squares | F-statistic: | 3202. |
| Date: | Sun, 03 Oct 2021 | Prob (F-statistic): | 0.00 |
| Time: | 16:43:46 | Log-Likelihood: | -2.8150e+05 |
| No. Observations: | 50000 | AIC: | 5.630e+05 |
| Df Residuals: | 49993 | BIC: | 5.631e+05 |
| Df Model: | 6 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | 6.4295 | 0.807 | 7.967 | 0.000 | 4.848 | 8.011 |
| paid_campaign_binary | 20.9800 | 0.609 | 34.453 | 0.000 | 19.786 | 22.174 |
| referral_google | 71.4277 | 0.824 | 86.708 | 0.000 | 69.813 | 73.042 |
| referral_instagram | -14.9127 | 0.824 | -18.088 | 0.000 | -16.529 | -13.297 |
| referral_facebook | 18.6255 | 0.829 | 22.469 | 0.000 | 17.001 | 20.250 |
| user_agent_mac | 43.2900 | 0.812 | 53.312 | 0.000 | 41.698 | 44.882 |
| user_agent_windows | -16.3470 | 0.745 | -21.930 | 0.000 | -17.808 | -14.886 |
| Omnibus: | 10538.016 | Durbin-Watson: | 2.004 |
|---|---|---|---|
| Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 20819.338 |
| Skew: | 1.277 | Prob(JB): | 0.00 |
| Kurtosis: | 4.863 | Cond. No. | 5.71 |
ols_clf = LinearRegression(fit_intercept = True)
ols_clf.fit(df[['paid_campaign_binary', 'referral_google',
'referral_instagram', 'referral_facebook',
'user_agent_mac', 'user_agent_windows']], df['order_euros'])
LinearRegression()
ols_clf.predict([[0,0,0,0,0,0]])
array([6.42952853])
ols_clf.predict([[1,0,0,0,0,0]])
array([27.40951576])
ols_clf.predict([[0,0,1,0,0,0]])
array([-8.48317663])
ols_clf.predict([[1,1,0,0,1,0]])
array([142.12722111])
data_lime_euros = df[['paid_campaign_binary', 'referral_google',
'referral_instagram', 'referral_facebook',
'user_agent_mac', 'user_agent_windows', 'order_euros']]
class_names_euros = data_lime_euros.columns
X_data_lime_euros = data_lime_euros[['paid_campaign_binary', 'referral_google',
'referral_instagram', 'referral_facebook',
'user_agent_mac', 'user_agent_windows',]].to_numpy()
y_data_lime_euros = data_lime_euros['order_euros'].to_numpy()
explainer = lime.lime_tabular.LimeTabularExplainer(
X_data_lime_euros,
feature_names=class_names_euros,
class_names=['order_euros'],
verbose=True,
mode='regression',
discretize_continuous=True)
exp = explainer.explain_instance(np.array([0,0,0,0,0,0]), ols_clf.predict)
exp.show_in_notebook(show_table=True)
Intercept 129.3470897147102 Prediction_local [6.47515026] Right: 6.429528527853279
exp = explainer.explain_instance(np.array([1,0,0,0,0,0]), ols_clf.predict)
exp.show_in_notebook(show_table=True)
Intercept 108.4000656122036 Prediction_local [27.43146288] Right: 27.409515755088165
As LIME is a framework still in development, we are not sure if it will work in all computers and configurations. If by any chance you get an error message when running LIME, hand in the challenge anyway (showing the error message), and we will accept it as complete.